206 research outputs found

    Low Frequency Ultrasonic Voice Activity Detection using Convolutional Neural Networks

    Get PDF
    Low frequency ultrasonic mouth state detection uses reflected audio chirps from the face in the region of the mouth to determine lip state, whether open, closed or partially open. The chirps are located in a frequency range just above the threshold of human hearing and are thus both inaudible as well as unaffected by interfering speech, yet can be produced and sensed using inexpensive equipment. To determine mouth open or closed state, and hence form a measure of voice activity detection, this recently invented technique relies upon the difference in the reflected chirp caused by resonances introduced by the open or partially open mouth cavity. Voice activity is then inferred from lip state through patterns of mouth movement, in a similar way to video-based lip-reading technologies. This paper introduces a new metric based on spectrogram features extracted from the reflected chirp, with a convolutional neural network classification back-end, that yields excellent performance without needing the periodic resetting of the template closed-mouth reflection required by the original technique

    Efficient multi-standard cognitive radios on FPGAs

    Get PDF
    Cognitive radios that support multiple standards and modify operation depending on environmental conditions are becoming more important as the demand for higher bandwidth and efficient spectrum use increases. Traditional implementations in custom ASICs cannot support such flexibility, with standards changing at a faster pace, while software baseband implementations fail to achieve the performance required. Hence, FPGAs offer an ideal platform bringing together flexibility, performance, and efficiency. This work explores the possible techniques for designing multi-standard radios on FPGAs, and explores how partial reconfiguration can be leveraged in a way that is amenable for domain experts with minimal FPGA knowledge

    Shaping spectral leakage for IEEE 802.11 p vehicular communications

    Get PDF
    IEEE 802.11p is a recently defined standard for the physical (PHY) and medium access control (MAC) layers for Dedicated Short-Range Communications. Four Spectrum Emission Masks (SEMs) are specified in 802.11p that are much more stringent than those for current 802.11 systems. In addition, the guard interval in 802.11p has been lengthened by reducing the bandwidth to support vehicular communication (VC) channels, and this results in a narrowing of the frequency guard. This raises a significant challenge for filtering the spectrum of 802.11p signals to meet the specifications of the SEMs. We investigate state of the art pulse shaping and filtering techniques for 802.11p, before proposing a new method of shaping the 802.11p spectral leakage to meet the most stringent, class D, SEM specification. The proposed method, performed at baseband to relax the strict constraints of the radio frequency (RF) front-end, allows 802.11p systems to be implemented using commercial off-the- shelf (COTS) 802.11a RF hardware, resulting in reduced total system cost

    Robust sound event recognition using convolutional neural networks

    Get PDF
    Traditional sound event recognition methods based on informative front end features such as MFCC, with back end sequencing methods such as HMM, tend to perform poorly in the presence of interfering acoustic noise. Since noise corruption may be unavoidable in practical situations, it is important to develop more robust features and classifiers. Recent advances in this field use powerful machine learning techniques with high dimensional input features such as spectrograms or auditory image. These improve robustness largely thanks to the discriminative capabilities of the back end classifiers. We extend this further by proposing novel features derived from spectrogram energy triggering, allied with the powerful classification capabilities of a convolutional neural network (CNN). The proposed method demonstrates excellent performance under noise-corrupted conditions when compared against state-of-the-art approaches on standard evaluation tasks. To the author's knowledge this in the first application of CNN in this field

    Robust Sound Event Detection in Continuous Audio Environments

    Get PDF
    Sound event detection in real world environments has attracted significant research interest recently because of it's applications in popular fields such as machine hearing and automated surveillance, as well as in sound scene understanding. This paper considers continuous robust sound event detection, which means multiple overlapped sound events in different types of interfering noise. First, a standard evaluation task is outlined based upon existing testing data sets for the sound event classification of isolated sounds. This paper then proposes and evaluates the use of spectrogram image features employing an energy detector to segment sound events, before developing a novel segmentation method making use of a Bayesian inference criteria. At the back end, a convolutional neural network is used to classify detected regions, and this combination is compared to several alternative approaches. The proposed method is shown capable of achieving very good performance compared with current state-of-the-art techniques

    Robust Sound Event Classification using Deep Neural Networks

    Get PDF
    The automatic recognition of sound events by computers is an important aspect of emerging applications such as automated surveillance, machine hearing and auditory scene understanding. Recent advances in machine learning, as well as in computational models of the human auditory system, have contributed to advances in this increasingly popular research field. Robust sound event classification, the ability to recognise sounds under real-world noisy conditions, is an especially challenging task. Classification methods translated from the speech recognition domain, using features such as mel-frequency cepstral coefficients, have been shown to perform reasonably well for the sound event classification task, although spectrogram-based or auditory image analysis techniques reportedly achieve superior performance in noise. This paper outlines a sound event classification framework that compares auditory image front end features with spectrogram image-based front end features, using support vector machine and deep neural network classifiers. Performance is evaluated on a standard robust classification task in different levels of corrupting noise, and with several system enhancements, and shown to compare very well with current state-of-the-art classification techniques

    Comparative whisper vowel space for Singapore English and British English accents

    Get PDF
    Whispered speech, as a relatively common form of communications, has received little research effort in spite of its usefulness in everyday vocal communications. Apart from a few notable studies analysing the main whispered vowels and some quite general estimations of whispered speech characteristics, a classic vowel space determination has been lacking for whispers. Aligning with the previous published work which aimed to redress this shortfall by presenting a vowel formant space for whispers, this paper studies Singapore English (SgE) from this respect. Furthermore, by comparing the shift amounts between normal and whispered vowel formants in two different English accents, British West Midlands (WM) and SgE, the study also considers the question of generalisation of shift amount and direction for two dissimilar accent groupings. It is further suggested that the shift amounts for each vowel are almost consistent for F2 while these vary for F1, showing the role of accent in proposing a general correlation between normal and whispered vowels on first formant. This paper presents the results of the formant analysis, in terms of acoustic vowel space mappings, showing differences between normal and whispered speech for SgE, and compares this to results obtained from the analysis of more standard English

    Improvements on Deep Bottleneck Network based I-Vector Representation for Spoken Language Identification

    Get PDF
    Recently, the i-vector representation based on deep bottleneck networks (DBN) pre-trained for automatic speech recognition has received significant interest for both speaker verification (SV) and language identification (LID). In particular, a recent unified DBN based i-vector framework, referred to as DBN-pGMM i-vector, has performed well. In this paper, we replace the pGMM with a phonetic mixture of factor analyzers (pMFA), and propose a new DBN-pMFA i-vector. The DBN-pMFA i-vector includes the following improvements: (i) a pMFA model is derived from the DBN, which can jointly perform feature dimension reduction and de-correlation in a single linear transformation, (ii) a shifted DBF, termed SDBF, is proposed to exploit the temporal contextual information, (iii) a senone selection scheme is proposed to improve the i-vector extraction efficiently. We evaluate the proposed DBN-pMFA i-vector on the most confused six languages selected from NIST LRE 2009. The experimental results demonstrate that DBN-pMFA can consistently outperform the previous DBN based framework. The computational complexity can be significantly reduced by applying a simple senone selection scheme
    corecore